Who am I

My name is Tomasz Plata-Przechlewski and I live in Poland. I was born on 16th june 1963 (it was Sunday, the exact day when Valentina Vladimirovna Tereshkova was launched into space – if you know who she is).

BTW in Poland born-in-sunday means work-shy (ie. lazy) person (so you know now first Polish? proverb)

BTW by pure statistics \(1/7 \approx 14\)% of the population is work-shy:-)

I graduated economy long time ago and teached statistics and information systems (mainly). I am a big fan of open source software (or OSS) and I knew a few OSS systems including Linux and LaTeX. And of course R which I am about to show you in a while.

My hobby is Road Cycling and History. A I am also a amateur photographer. (cf tprzechlewski@flickr)

Agenda

Statistics (nothing spectacular, just classical EDA)

Statistical software (modern, non-standard or hipster #youcall)

Poland (via statistical examples)

Three components of Statistics

Theory (models) + Tools (programs) + Practice (real data)

Undergraduate courses in social sciences in Poland concentrate on theory, use Spreadsheet as an universal computing tool Office-like editor (MS Word/OO Writes) as an universal publishing tool. Students works with artificial (clean) and small data sets thus are unaware of problems related to applying theory to practice.

It is claimed that the above scenario is optimal. More advanced tools would be too difficult (and time consuming) to be acquinted to by students, thus distracting them from the main subject of the course, ie statistical methods.

Office sofware has limits. Spreadheets are good for number crunching, but are not so good in: data cleaning (Practice), advanced graphics, spatial analysis (T), team work(Practice). Office editors or Powerpoint/ are great tools but are not quality publishing of statistical results.

In my (humble) opinion it is completny wrong not to use some modern tools even in introductory courses as it is (often) the only lectures undergraduta students complete.

I will try to demonstrate that using modern tools for statistical analysis is the way to go. That (some) modern tools are not more difficult that office software (at higher than basic level)

Conclusion: less theory, more pratice and common sense.

Learnig curve comparison

Learning statistic to social-science studies in Poland

Typically a statistical course for undergraduate students in social sciences in Poland contains: descriptive statistics for one variable, descriptive methods for an association of two variables elementary time series analysis (moving averages, linear trend/seasonality) Interference/probablity calculus is lectured marginally or omitted as well as graphical methods and information visualization.

Oversimplistic ‘value chain’ of statistical analysis

model/description -> interpretation

raw data -> consistent data -> results -> publishing

Example: Full Time Equivalence (FTE)

Number of students.

Who is a student?

Student is a person attending to a 3rd level status school in in the 3-stage education system (cf Educational_stage). The answer is still non-obvious as there are many forms of teriary education. For example:

The UNESCO stated that tertiary education focuses on learning endeavors in specialized fields. It includes academic and higher vocational education.

So according to the above definition the school do not belongs to tertiary education if its status is not academic and/or higher vocational. Example: Dance Academy or University for Elderly people (aka University of the 3rd Age). Both are popular in Poland.

In many countries there are some certification scheme. For example in Poland a school must apply (and get) a certificate to be regarded as high school (ie part of tertiary level of education)

Heads vs Majors

Student can be enrolled to more than one course (major). So for counting heads it is necessary to remove duplicates otherwise one would count majors not persons.

Part time studies

FTE stands for Full-Time-Equivalent, an approximation of the number of students who would be enrolled full-time

Full time equivalent (FTE) – FTE is based on student credit hours. It is obtained by dividing student credit hours by some a number of credit hours for full-time-study.

Conclusion: Majors, Persons or FTEs? Which is the best?

University of Utah/Office of Analysis, Assessment and Accreditation google:single multiple majors fte

Example: measurement of tourism activity [concept of an Indicator]

Who is a tourist. According to Glossary:Tourism

Tourism means the activity of visitors taking a trip to a main destination outside their usual environment, for less than a year, for any main purpose, including business, leisure or other personal purpose, other than to be employed by a resident entity in the place visited.

According to the above definition to be regarded as tourist one has to change her/his accomodation place for less than one year (otherwise Eurostat would regard her/him as migrat)

The usual meaning (at least in Poland) is that tourist is travelling for leasure not to work. Poeple travelling to work has other needs/aims than those travelling to rest (they usually do not use hotels for example) so the above definition solves some problems but at the same time creates many others.

Number of tourists: do not distinguish between various form of turists, difficult to collect (who is a turist anyway?)

Various `number of’ tourist-oriented establishments (hotels, catering units, beds, nights spent) etc. They do not measure turists per-se but are highly related and more reliable (as easier to count).

Indicator of turist activity (by various tourist types).

Conclusion: measurement of tourism activity is not trival Other similar: internet user, migrant, unemployed person, illiterate person

Example: measurement of tourism activity [data collection]

Tourism supply statistics (accommodation statistics): Data on rented accommodation ie. capacity and occupancy of tourist accommodation establishments in the reporting country. How collected? Registers?

Quirks of data collection: Data up to year 2015 inclusive refer to only those units that made the statistical reports. Starting of data for January 2016, the method of imputation data was implemented (ie replacing missing data with some (possibly meaningful :-)) values. (cf BDL)

Tourism demand statistics: Data on participation in tourism of the residents of the reporting country. How collected? Surveys?

Most of the time, data on domestic and outbound trips (where “outbound tourism” means residents of a country travelling in another country) is collected via sample surveys (cf Annual data on trips of EU residents and Tourism_statistics_-_top_destinations)

Regulations concerning data collection in turism (hundreds of pages): Glossary:Supply_side_tourism_statistics and EU regulation No 692/2011

So now we know what we are dealing with…

Example nights spent (demand side)

Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country of residence, 2017 Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country

Country of residence -> Foreign country (estimated data)

year 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
# 10173237 9609447 10064628 10620264 11876599 12471268 12992241 13757657 15579225 16705215
##  [1] 10173237  9609447 10064628 10620264 11876599 12471268 12992241
##  [8] 13757657 15579225 16705215

to be continued…

Dreadful example: Purchasing Power Parity, inflation

and international comarison of GDP

** ADD HERE **

Modern Approach

czy-ukraincy-wiaza-swoja-przyszlosc

Why universities for the elderly are booming in China yyy

Measures of central tendency, dispersion and skewness

(univariate analysis)

The CSV file hotele_caloroczne_PL.csv contains data on number of all-season hotels in every county in Poland. First one has to load the dataset with the read.csv command:

d <- read.csv("hotele_caloroczne_PL.csv", sep = ';',  header=T, na.string="NA")

Computing measures of central tendency (with summary and/or fivenum)

summary(d)
##      teryt              powiat      hotele2012        hotele2017    
##  Min.   : 201   bielski    :  2   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:1005   brzeski    :  2   1st Qu.:  3.000   1st Qu.:  4.00  
##  Median :1636   grodziski  :  2   Median :  5.000   Median :  7.00  
##  Mean   :1721   krośnieński:  2   Mean   :  8.776   Mean   : 10.31  
##  3rd Qu.:2475   nowodworski:  2   3rd Qu.: 10.000   3rd Qu.: 11.00  
##  Max.   :3263   opolski    :  2   Max.   :158.000   Max.   :183.00  
##                 (Other)    :368   NA's   :1
fivenum(d$hotele2017)
## [1]   0   4   7  11 183

Computing mean:

mean(d$hotele2017)
## [1] 10.31053

And dispersion:

var(d$hotele2012); var(d$hotele2017)
## [1] NA
## [1] 244.8743
sd(d$hotele2012); sd(d$hotele2017)
## [1] NA
## [1] 15.64846

Second attempt (and more compact output):

c(var(d$hotele2012, na.rm=T), var(d$hotele2017, na.rm=T),
 sd(d$hotele2012, na.rm=T), sd(d$hotele2017, na.rm=T))
## [1] 183.11094 244.87430  13.53185  15.64846

BTW:

c( mean(d$hotele2012, na.rm=T), mean(d$hotele2017, na.rm=T))
## [1]  8.775726 10.310526

Or more formally. There were 8.7757256 hotels on the average in every county in Poland in 2012 while in 2017 there were 10.3105263 hotels.

Interquartile Range aka IQR which is the range from the upper (75%) quartile to the lower (25%) quartile. IQR represents central 50% observations of a population. IQR is a robust measure of dispersion, unaffected by the distribution of data:

c( IQR(d$hotele2012, na.rm=T), IQR(d$hotele2017, na.rm=T))
## [1] 7 7

Finally we can equally easilly assess the skewenss:

Measures of concentration/inequality distribution (univariate analysis)

Some variables by definition are positive (or non-negative): income, market share.

sampling-distribution-of-gini-coefficient income-inequality

Diversion: R

Diversion: Rstudio

Charts (purpose of)

Decoration,

One graph is more effective than another if its quantitative information can be decoded more quickly/easily [Robbins 2005]

Types of charts

Recommended: (ordered) dot plots, bar charts, histograms and kernel density estimates, stripcharts, multipanel displays (instead of stacked bars multiple line/dot plots) scatterplots (two variables)

Not recommended: Pie charts, bubble charts, stacked bar charts,

Bar/line/pie charts introduced by James Playfair in XVIII century. Dot plots introduced by Cleveland (1980s). Box-plots introduced by Tukey (1970s)

Never use (pseudo) 3D charts for 2D data. Virtually no-one can read them

Pie charts

Nights spent by non residents

Dot plots

Jittering: adding random noise to data to avoid overlapping.

Bar charts

What, when and where

Before we continue with statistical graphicsa short 2 slides diversion on geocode standards used in statistics.

No doubt in every reliable survey the population has to be precisely defined ie 3 dimensions of every surveyed unit should be fixed: definition (what), time (when measured), space (where)…

I always repet to my students: if you look at some data (in the media for example), start from establishing if you know what, when and where. If no information (or reliable link–called source–to information) is provided on any of the fixed dimensions of data, treat this data as rubbish and do not waste time to use/analyse it.

Further dissemination of such defective data should be subjected to publicly prosecuted (joke)

I tried to show you already that what is complicated and often highly unreliable/arbitrary (the nature of the phenomenon or/and measurement difficulties).

What dimension much more simpler due to universal standard, ie. time. You gather data or for a certain moment (how many hotels are in use in 31st December 2018) or for certain period of time (how many beds were sold in these hotels in 3rd quarter of 2018).

Where dimensions in turn is usually based on administrative or statistical (geographical) units (country, state/province, county, community). But contrary to time dimension there is no universal or globally-accepted standard for geostatistical units. Usually such a standard is based on administrative system which is country-dependent.

The administrative division of Poland since 1999 has been based on three levels of subdivision (cf Administrative divisions of Poland. In 2001 as Poland became a member of European Union, EU regulations are part of national law system.

EU regulates everything, statistics included.

Conclusion: The pigs had to expend enormous labours every day upon mysterious things called “files,” “reports,” “minutes,” and “memoranda.” These were large sheets of paper which had to be closely covered with writing, and as soon as they were so covered, they were burnt in the furnace (George Orwell, Animal Farm)

NUTS and TERYT

The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)

NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)

NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – county We would like to plot a chart showing number of hotels.

Poland is divided into 16 states (NUTS2) and 380 counties NUTS3 which are equal to administrative units. So on the average there ar 23.75 counties per state. NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )

There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce

The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW provice is Polish is “prowicja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every provice ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland

NUTS3 consists of 380 counties (called “powiat”). In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta“.”Stary" means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)

There is no NUT4 level but there is 3rd level of Polish administration used by GUS (Main Statistical Office). This 3rd level is called “gmina” (community).

There are (approximately) 2750 communities in Poland. As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each community has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina“.

TERYT is a Polish NUTS (developed in 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-community (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).

So you are now experts on administrative division of Poland, and we can go back to statistical charts…

Strip charts

A strip chart (strip plot) shows the distribution of data points along a numerical axis.These plots are suitable compared to box plots when sample sizes are small (because preserve more information about the data).

Histograms and kernel density functions

Histograms show the distribution of a set of data. To draw a histogram the numbers (observations) are grouped into bins (intervals or classes). There is a tradeoff between showing details or showing an overall picture. When bin width changes the scale at Y-axis changes as well (more bins less observations in each bin).

ggplot(d, aes(x = hotele2017)) +
  geom_histogram(bins = nclass.Sturges(d$hotele2017))

Histograms with binwidth equal to 20,10, 5 and 1 respectively:

Kernel density functions

ggplot(data=d) + geom_density(aes(x=hotele2017))

p1 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=0.25)
p2 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=1.0)
p3 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=2.0)
p4 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=8.0)
ggarrange(p1,p2,p3,p4)

Comparing distributions box-plots vs multiple histograms

Box-plots are much better than histograms for comparing distributions of more than one data sets.

Construction of a (typical) box-plot: The middle bar is a median. Top/bottom bars of the rectangle shows the IQR (interquartille range is 1st and 3rd
quartille), the fanciful bars above/below rectangle called whiskers (google: whiskers mustache :-) are 1,5 times the IQR (or minimu/maximum if those values are less than plus/minus 1,5 IQR. The symbols above/below whiskers (usually open circles) are outliers (non typical/extreme values)

Note the trick: outliers are defined not as (for example) top/botom 1% fraction of values (every distribution would has outliers in such a case) but as values less/more than Me - 1,5IQR (distributions with medium variablity would not have outliers)

Example: age of Nobel-prize winners (cf The Nobel Prize API Developer Hub)

d <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");

ggplot(d, aes(x=category, y=age, fill=category)) + geom_boxplot() + ylab("years") + xlab("");
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).

Multiple histograms are too detailed (binwidth=5). It is impossible for example to establish which category has the youngest (on the average) laureate, or which category has an oldest one (economics and literature are candidates, but due to multimodality of literature laureates distribution it is difficult to assess this for sure…)

ggplot(d, aes(x=age, fill=category)) + geom_histogram(binwidth=5) +
    facet_grid(category ~ .)
## Warning: Removed 39 rows containing non-finite values (stat_bin).

Scatter-plots

A scatter-plot (aka scatter diagrams, xyplot) is a basic form used for two (quantitive) variables.

To see the relationship between variables, a line is can be fitted. Least square (LS) line which assumes linear relationship between variables, is fitted by minimizing the sum of squares of the residuals (residual is the difference between a data-point and a relevant line-point ie a point computed from the formula y = a +bx where x is the value of the x-axis variable.)

Alternatively loess curve can be used which do not assumes linearity.

Scales

Logarithmic scale makes it possible to plot values with too wide range for a linear scale. Base 10 logarithms squeeze' the numbers more than base 2 logarithms (log10(100)=2 wile log2(100)=6.64. Moreover is the original scale contains multiplications of 10 use log10 to getnice’ log-scale while it contains multiplications of 2 use log2.

Logarithms transforms additive scale to `multiplicative’ one. Example (Nobel prize again):

dA <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",",  header=T, na.string="NA");
nrow(dA)
## [1] 934
dS <-  subset(dA, (! bornCountryCode == "" )) ## by country of birth
nrow(dS) ## how many
## [1] 901

aggregate by bornCountryCode

Finally plot the resulting data using various Y-axis scales (arithmetic, log2 and log10)

Graphic perception tasks

Position along common scale Position along common but nonaligned scales Length Angle (slope) Area Volume Color (hue), Color (saturation), Color (density of black)

Angle judgement is not precise. Acute angles are underestimated while obtuse angles (greater than 90) are overestimated.

Area judgement is biased as well. It is impossible to distinguish small differences in area, while quite easy when the same date is plotted along common scale

The most accurate of graphic task is positioning along common scale

General design rules

To visualize n-dimensional data do not use more dimensions than n.

Always include 0 in numerical axes

** ADD **

Banking to 45

The ratio between the width and the height of a rectangle is called its aspect ratio.

The aspect ratio describes the area that is occupied by the data in the chart. A change in aspect ratio changes the perception of the graph. The question is which aspect ratio is the best.

We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.

Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.

Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).

*** Example ***

Diversion: R (making charts)

The lie factor and data-ink ratio [Tufte]

Example: education system in Poland

Elementary spatial analysis (Heat maps/tematic maps)

Geocoding and reversegecoding

Diversion: tools for geocoding and reversegecoding

Diversion: tools for building (heat/tematic maps)

QGis

Example: Poland (population, incomes, distribution of)

More examples of spatial charts

Summary: bad graphics examples

Bivariate analysis

Example: tourist vs industry vs education (in Poland)

Example:

Timeseries analysis

Bad example: tourists at Malbork castle

The determinants of the tourist traffic in the castle’s museum of Malbork

Elementary spatial analysis

Example: industry concentration (in Poland)

Information presentation by Edward Tufte

Tufte’s famous slide

Resources

cheatsheets QGIS tutorials gis.stackexchange.com

Data banks

Tourism cd

nui

Kaggle (Coffee production/consumption)

icos-crop-data exploring-coffee-production-and-consumption ico-coffee-crop-data-data-wrangling

Varia

hours-worked

Reproducible research

So you probably still wander why I am punishing myself with using such a odd system. The most important argument why I will present momentarily and it concerns the basic approach (philospohy if one has to be phatetic) of doing statistical analysis.

This mode (or concept) is called Reproducible Research (RR in short).

Serious statistical analysis is not one-off job. There is a value-chain as well as a life cycle of statistical analysis. Value chain means that there are distinct stages while life cycle that the same data/models are used for years and most statistical analysis do not start from the scrach but are based on data from the past augmented with new data. The problem is that the new data and model modifications should be in-sync with the past.

The make the problem worse, serious statistics should be also in-sync with the work of others (to ease or to make possible any meaningful (international) comparisons for example)

Diversion: Github

New Tools (hipster part)

R/Rstudio for computing and data visualization

Github for enhancing team work

markdown for reproducible research

New practice [recap]

More Data banks

UK immigration-statistics

Example educational resources

https://git.generalassemb.ly/briancwq/classes/blob/master/week-01/lessons/python-descriptive_statistics_numpy-lesson-master/archive/LESSON.md

ukraine-deputies

Questions

Thanks